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1.  scope 

1.1  Identification 

This  document  is  the  FinaJ  Report  for  contract  No.  F04704-S9-C-0044 
entitled  Embedded  Fault-Tolerant  Computer  for  l^fission-Criiical  Applica¬ 
tions  atvarded  to  Fail-Safe  Technologj*  Corporation  (FST)  by  the  Ballistic 
Missile  Office  (BMO)  under  the  .Air  Force  Systems  Command  (.AFSC)  as 
a  Phase  II  Small  Business  IrLno\‘ation  Research  (SBIR)  contract. 

1.2  Purpose 

The  objective  of  this  effort  is  to  develop  an  Embedded  Fault-Tolerant  Com¬ 
puter  (EFTC)  using  a  combination  of  off-the-shelf  and  custom  hardware 
and  software  components.  This  prototype  will  then  be  analyzed  in  order 
to  determine  the  concept  feasibility.  ' 


1.3  Introduction 

This  report  summarizes  the  main  actmties  and  results  of  the  EFTC  devel¬ 
opment  as  it  has  progressed  through  Phase  II.  It  describes:  the  develop¬ 
ment  of  the  EFTC  system  from  evolution  through  hardware  and  software 
design  (Section  2);  brassboard  implementation  of  hardware  and  software 
(Section  3);  and  the  successful  demonstration  of  generic  capabih’ties  as  well 
as  applications  (Section  4).  In  addition,  summaries  of  ■various  ways  in 
which  the  system  can  be  enhanced  illustrate  the  flexibility  of  the  design 
(Section  5). 
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2.  System  Development 

In  this  section  we  describe  the  way  in  which  the  EFTC  system  was  de¬ 
veloped  in  terms  of  the  underlj'ing  software  processes  and  the  hardware 
architecture. 

2.1  Evolution 

The  first  task  of  this  SBIR  effort  was  to  determine  more  detailed  require¬ 
ments  for  the  EFTC.  We  started  by  discussing  existing  and  future  projects 
■with  Xorton  BMO  personnel.  .4fter  a  few  briefings  to  the  Rail  Garrison 
and  Small  Missile  project  offices  at  Norton,  we  were  directed  to  vendor 
of  serrices  and  systems  to  Norton.  We  talked  to  Rockwell,  TRW,  GTE, 
and  Ford  Aerospace  about  the  requirements  for  fault-tolerant  computers  in 
ICBM  systems. 

The  conclusion  was  that  no  great  needs  could  be  identified  for  fault-toler¬ 
ance  in  the  missiles  because  of  short  mission  duration  and  e.xisling  safety 
systems.  Also,  no  new  ICBM  designs  are  currently  funded  or  expected  in 
the  near  future. 

Possible  requirements  for  fault-tolerant  computers  were  identified  for  the 
ICBM  ground  support  equipment  (GSE),  test  equipment,  and  command, 
control  and  communication  (C^)  systems.  The  most  common  need  we  found 
for  fault-tolerant  computers  was  for  communication  control  computers  us¬ 
ing  a  PC-class  workstation.  Solutions  not  requiring  a  fa\ilt-toIerant  com¬ 
puter  had  been  developed  for  the  existing  equipment,  but  a  fault-tolerant 
computer  woidd  have  been  used  if  it  were  a’^’ailable. 

In  looking  at  the  needs  of  future  programs,  no  new  equipment  designs 
were  identified.  Progreuns  Hke  Rail  Garrison  and  small  ICBMs  were  based 
on  existing  designs.  The  conclusion  of  this  study  was  that  there  was  a 
need  for  fault-tolerant  PC-class  computers  in  new  ICBM  GSE,  C^,  and  test 
eqiupment,  but  no  new  equipment  was  planned  for  the  foreseeable  future. 

Personnel  at  BMO  suggested  we  talk  to  AFSC  for  possible  applications. 
The  Ad^•anced  Program  Office  at  AFSC  indicated  that  few  new  satellite 
programs  were  planned,  and  existing  programs  had  all  developed  solutions 
to  meet  their  reliability  requirement.  The  need  for  fault-tolerant  comput- 
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ers  in  new  programs  was  primarily  in  the  GSE.  In  looking  at  the  tj-pical 
requirements  for  GSE,  a  PC-class  workstation  using  Intel  3S6  or  4S6  pro¬ 
cessors  would  meet  most  requirements. 

In  addition  to  the  requirements  of  BMO  and  AFSC,  FST  looked  at  other 
government  and  commercial  application  customers,  ■\^e  determined  that 
there  was  a  need  for  a  fault- tolerant  computer  at  a  lower  price  than  the 
existing  commercial  fault-tolerant  computers  such  as  Tandem  and  Stra¬ 
tus.  In  addition,  users  wanted  hardware  that  would  run  low-cost  software 
developed  for  the  IBM  PC-compatible  hardware  platforms. 

In  ^^ew  of  those  discussions,  the  decision  was  made  in  conjunction  ■with 
BMO  to  develop  the  EFTC  as  a  fault-tolerant  IBM  PC-compatible  platform 
that  functions  as  a  network-based  file  server. 

2.2  Concept  and  Approach 

Developing  a  fault-tolerant  computer  is  not  an  easy  task.  The  design  is 
constrained  by  many  factors  that  results  in  several  tradeoffs  while  develop¬ 
ing  and  meeting  a  set  of  requirements.  In  order  to  make  this  task  tractable, 
we  have  approached  it  by  using  a  well-established  and  systematic  design 
paradigm  for  fault-tolerant  systems;  the  next  section  outlines  the  steps  ia 
this  paradigm. 

2.2.1  Fault-tolerance  Design  Paradigm 

Given  a  set  of  system  requirements  that  define  the  ser*dc'*s  fo  be  delivered 
and  the  sertdce  boimdaries  at  which  service  delivery  ■will  taie  place,  the  key 
steps  of  the  paradigm  aire  as  follows: 

1.  The  dependability  goals  of  the  system  are  specified  in  terms  of  relia¬ 
bility,  availability,  maintainability,  safety,  and  so  forth.  This  requires 
three  steps. 

(a)  First,  the  classes  of  hardware  and  software  faults  that  are  to 
be  tolerated  over  the  life  of  the  system  are  exph’dtiy  identified. 
Fault  classes  are  chosen  such  that  faults  that  elidt  the  same  error 
syndrome  are  grouped  into  the  same  class,  thereby  reducing  the 
scope  of  the  effort. 
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(b)  Second,  qucoititative  goads  for  the  dependability  of  system  ser- 
tTces  are  specified. 

(c)  Third,  the  methods  for  e^■aJuating  the  dependability  actually 
attained  by  the  system  are  specified  in  detail. 

2.  The  system  is  partitioned  into  subsystems  (hardware,  software,  com¬ 
munication,  interfaces)  for  implementation,  tating  into  account  boih 
performance  and  fault-tolerance. 

3.  Error  detection  and  fault  diagnosis  algorithms  for  every  subsystem 
are  selected.  The  choices  of  error  detection  and  fault  diagnosis  tech¬ 
niques  are  guided  by  the  dependability  goals.  We  ascertain  that  all 
rele^•^nt  fault  classes  are  detectable,  and  that  the  probability  of  timely 
detection  is  adequate. 

4.  State  recovery  and  fault  remo^-al  techniques  are  demised  that  are  in¬ 
voked  by  the  fault  signals  from  fault  detection  algorithms.  Their  goal 
is  to  return  the  system  to  some  level  of  proper  operation  or  to  shut 
it  down  safely.  Fault  signal  invoked  recovery  is  classified  into  three 
classes  during  the  design  process: 

(a)  recovery  to  original  performance  (fail-operational); 

(b)  recovery  to  degraded  performance  (fail-soft); 

(c)  execution  of  safe  shutdown  (fail-safe). 

.4  fourth  d2iss  of  recovery  algorithms  that  does  not  depend  on  a  famt 
signal,  but  meiintains  original  performance  by  the  use  of  concurrently 
active  protective-redundancy  is  also  considered  (masking). 

5.  Subsystem  fault-tolerance  is  integrated  into  the  overall  system. 

6.  .4n  e^•aluation  of  the  fault-tolerance  of  the  design  and  its  impact  on 
performance  is  then  performed  na  a  combination  of  amaJytic  and 
simulation  modeling. 

7.  .4  refinement  of  the  design  is  then  carried  out.  If  (he  initial  e>-alua- 
tion  demonstrates  that  the  ^•arious  hardware  and  software  subsystems 
fail  to  meet  the  primary  dependability  specification,  or  if  there  are 
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unequal  contributions  to  the  cveraU  system  dependability,  steps  2 
through  6  axe  repeated.  The  goal  of  this  refinement  step  is  to  balance 
the  protection  pro^■ided  to  eaxh  subsystem  so  that  the  dependability 
goal  is  achieved  ■without  a  single  dominating  contributor  to  nonde- 
pendabiiity,  and  at  the  lo'west  cost  of  additional  resources. 

2.3  System  Specification 

The  specification  of  the  EFTC  hardware  and  software  was  accomplished  by 
using  the  design  paradigm  of  Section  2.2.1  in  conjunction  •with  a  structured 
development  methodology  ’HaPSS].  The  design  paradigm  defines  the  key 
steps  of  the  design  and  the  structured  development  methodology  prorides 
a  stylized  mechanism  for  developing  and  documenting  each  step.  Our  ap¬ 
proach  is  to  first  define  fatilt-tolerance  requirements  for  the  system  based 
on  its  dependabihty  goals,  and  then  develop  detailed  specifications  ria  a 
System  Specification  Model  (SSM). 

2.3.1  Baisic  System  Description 

The  following  top-level  system  description  is  prorided  here  as  a  basis  for 
better  understanding  the  system  development  process  and  models  that  fol¬ 
low. 

The  EFTC  provides  file-server  (FS)  functions  for  workstations  in  a  local- 
area  network  (L.4.N)  enrironment.  The  baseline  hardware  architecture  is 
depicted  in  Figure  1.  Workstations  on  the  L.4N  may  use  the  file  server  as  an 
information  repository  or  to  perform  other  network- ■wide  serrices  such  as  a 
centralized  printer.  Workstations  interface  with  the  file-server  by  sending 
file-server  requests  across  the  LAN,  and-receive  responses  in  the  form  of 
data  or  control  messages  from  the  file-server. 

The  file-server  runs  the  Novell  Netware  operating  system,  and  workstations 
interface  to  it  ria  Novell-compatible  ch’ent  modules  that  execute  locally. 

2.3.2  Fault-Tolerance  Requirements  Definition 

In  this  section  we  estabh’sh  the  fault-tolerance  reqiu'rements  for  the  EFTC 
system. 
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2. 3. 2.1  Fault  Set  Definition.  A  comprehensive  fault  model  of  the  sys¬ 
tem  was  constructed.  This  model  lists  all  faults  that  are  to  be  mitigated  and 
their  attributes.  To  limit  the  amount  of  faults  that  need  to  be  considered, 
faults  that  generate  the  same  error  or  failure  response  were  grouped  into 
the  same  class;  the  objective  being  to  “cover”  as  many  faults  as  possible 
'ndth  as  few  mechanisms  as  possible. 

The  basic  fault  set  was  constructed  by  asstrming  that  the  “ser^-ice  bound¬ 
ary”  encapsulates  the  entire  system;  the  ser^-ice  boundary  is  an  imaginary 
interface  at  which  ser\-ice  is  delivered  to  a  user.  Users  in  this  case  are 
workstations  on  the  network,  or  an  operator  that  interfaces  directly  with 
the  system  via  a  system  console.  The  faults  in  the  set  therefore  account  for 
all  possible  faults  •within  the  system  that  can  lead  to  a  failure  from  a  user’s 
perspective.  This  resulted  in  the  foUo'wing  list: 

•  .4ny  permanent  fault,  or  any  transient  or  intermittent  fault  wdih  a 
duration  greater  than  5  seconds,  that  results  in  loss  of  system  power. 

t  .A.ny  permanent  fault  that  results  in  loss  of  the  CPU. 

•  .A.ny  permanent  fault,  or  any  transient  or  intermittent  fault  ■with  a 
duration  greater  than  5  seconds,  that  results  in  partial  or  total  loss 
of  the  contents  of  the  RAM. 

•  .4ny  permanent  fault,  or  any  transient  or  intermittent  fault  with  a 
diuation  greater  than  5  seconds,  that  results  in  loss  of  the  ethernet 
controller. 

•  Any  permanent  faiilt,  or  any  transient  or  intermittent  fault  •with  a 
duration  greater  than  5  seconds,  that  results  in  loss  of  the  SCSI  disk 
controller. 

•  Any  permanent  fault,  or  any  transient  or  intermittent  faiilt  ^dth  a 
duration  greater  than  5  seconds,  that  results  in  loss  of  critical  infor¬ 
mation  on  a  hard-disk  that  directly  supports  file-server  operations. 

The  decomposition  of  this  basic  fault  set  into  subclasses  depends  on  the 
next  step  in  the  paradigm,  whicl  is  to  partition  the  system  into  subsystems 
based  on  system-level  fault-tolerance  and  other  requirements. 
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2. 3. 2. 2  Dependability  Goals.  The  fault-tolerant  design  of  the  EFTC 
depends  on  its  dependability  goals.  Since  it  is  going  to  be  used  in  an  en- 
%'ironment  where  manual  repair  is  possible,  we  felt  that  an  a^•ailabili(y  re¬ 
quirement  was  more  imp^-tant  than  a  reliabih’t}-  requirement,  ^^’hereas 
reliability  defines  the  probability  of  correct  ser^^ce  delivery  for  a  specified 
“mission  time,”  a^mlability  defines  the  long-term  or  steady-state  probabil¬ 
ity  of  correct  service  delivery.  In  other  words,  we  felt  it  was  more  important 
that  the  system  be  operational  a  large  portion  of  time,  even  though  it  may 
suffer  infrequent  outages  due  to  failures.  The  atmlabiL'ty  reqturement  de¬ 
fines  the  percentage  of  time  the  system  must  be  operational. 

For  this  effort,  steady-state  at-ailability,  .4„,  was  defined  as; 

MTTF 

■  "  ~  MTTR-bMTTF 

where  MTTF  =  1/A  and  MTTR  =  l/p,  and  A  and  n  represent  constant 
failure  and  repair  rates,  respectively.  MTTF  is  defined  as  the  Mean  Time 
to  Failure,  and  MTTR  the  Mean  Time  to  Repair.  Given  this  definition, 
A,,  was  determined  in  the  following  way:  Of  all  the  faults  in  the  basic  fault 
set  (see  Section  2. 3. 2.1),  our  experience  with  PC-.\T-cIass  machines  in  a 
file-server  configuration  as  depicted  in  Figtne  1  is  that  hard-disk  failures  is 
the  dominant  failure  class  with  an  approximate  MTTF  of  5000  hours.  Our 
experience  also  is  that  after  a  hard-disk  failure,  the  MTTR  is  anijm-here  from 
2  to  8  hours  tviih  a  mean  of  about  4  hours,  depending  on  parts  at’ailability. 
If  spare  disks  are  on-hand,  then  the  repair  time  is  about  2  hours  to  replace 
the  disk  and  perform  a  restoration  from  backup  volumes.  Given  an  MTTF 
of  5000  hours  and  an  MTTR  of  2  hours,  A„  for  the  baseline  non-redundant 
configuration  is  0.996  (99.6%).  Our  go^  was  to  improve  this  availability 
considerably.  Since  a  hard-disk  is  a  prepackaged  unit,  there  is  no  way  to 
improve  its  MTTF  directly.  However,  effective  MTTF  of  the  disk  can  be 
improved  ria  replication  of  the  hard-disk  unit.  For  example,  if  a  second 
(physically  identical)  hard-disk  is  provided,  and  if  it  has  a  probability  of 
failure  independent  of  the  first  hard-disk,  then  the  effective  MTTF  is  the 
sum  of  the  MTTFs  of  both  httrd-disks.  This  assumes,  of  course,  that  both 
hard-disks  are  configured  so  that  they  represent  a  single  logical  hard-disk. 
The  second  way  to  improve  the  a>'ai]abih’ty  is  to  reduce  the  MTTR.  We 
felt  that  it  wtis  possible  to  reduce  the  MTTR  from  two  hours  to  at  most 
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a  few  minutes  through  the  judicious  application  of  fault-tolerance.  For 
example,  if  the  MTTF  is  reduced  to  2500  hours — to  be  coDse^^■alive — and 
the  MTTR  to  5  minutes,  then  A„  improves  to  0.99997.  On  the  basis  of 
this  analysis,  our  goal  for  .4,,  was  set  to  >  0.9999  (this  roughly  corresponds 
to  at  most  17.5  minutes  of  unat-ail ability  per  calendar  year  of  continuous 
operation).  .4n  associated  maintainabihty  goal  was  to  achieve  an  MTTR 
of  <  5  minutes. 

2. 3. 2. 3  Dependability  Evaluation.  The  design  paradigm  requires  that 
the  methods  used  to  evaluate  the  dependability  goals  specified  in  Sec¬ 
tion  2. 3. 2. 2  be  specified  at  this  point.  The  plan  for  meeting  the  main¬ 
tainability  goal — MTTR  <  5  minutes — was  relatively  simple:  we  simply 
measure  the  maximum  time  required  to  recover  from  all  faults  in  the  fault 
set.  E^•aIuating  at-ailability  is  more  difficult.  E\-aluating  a\-ailability  by 
measuring  the  ratio  of  uptime  to  total  time  is  infeasible  since  the  total  time 
must  be  very  long  so  that  a  statistically  meaningful  number  of  system  fail¬ 
ures  can  occur  (this  time  is  estimated  to  be  about  10000  hours — roughlv 
one  year  of  continuous  operation).  The  alternatives  were  simulation  mod¬ 
eling  and  analytic  modeling.  Our  choice  was  to  use  analytic  modeling  of 
the  final  hardware  configuration  since  the  time  to  develop  and  run  suitable 
simulation  models  is  beyond  the  scope  of  this  effort. 


2.3.3  Detection  and  Recovery  Algorithms 

The  next  step  of  the  design  paradigm  requires  that  detection  and  recovery 
algorithms  be  chosen.  .Although  this  is  an  iterative  process  that  cannot 
be  completed  until  other  steps  of  the  paradigm  are  complete — such  as  the 
fault  set,  it  was  ein  opportunity  to  make  some  high-level  decisions  about 
the  way  in  which  detection  and  recovery  will  be  hamdled  in  the  EFTC. 

One  design  constraint  was  the  need  to  reduce  the  amount  of  custom  hard¬ 
ware  and  software  necessary  to  implement  the  EFTC,  while  retaining  com- 
patibih'ty  with  standards — dt  faclo  or  otherwise — developed  for  the  class 
of  machines  chosen.  hile  it  is  possible  to  implement  fault- tolerance  any¬ 
where  from  the  system  to  the  component  level,  the  cost  and  complexity 
increases  the  closer  we  get  to  the  component  level.  Our  first  choice,  there- 
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fore,  vas  to  determine  if  tlie  dependability  requirements  could  be  met  by 
appMng  fault-tolerance  at  tbe  system  level.  If  ibis  is  not  possible,  iben  tbe 
next  choice  i\-ould  be  fault-tolerance  at  tbe  level  of  indindual  subsystems, 
and  so  on. 

System-level  fault-tolerance  requires  ibat  repbcation  be  used  at  tbe  sys¬ 
tem  level;  i.e.,  tbe  entire  system  is  repbcated.  Error  detection  algoritbms 
then  monitor  and  detect  errors  at  tbe  system  boundary,  and  error  recovery 
involves  tbe  dynamic  manipulation  of  entire  systems.  In  tbe  case  of  the 
baseline  system  (see  Figure  1),  tbe  “system”  that  is  to  be  replicated  is  tbe 
box  labeled  ‘  File  Server”  on  tbe  diagram.  This  system  is  a  3S6-based  PC- 
compatible  configured  to  operate  as  a  file  server  using  tbe  Novell  Netware 
operating  system.  In  our  first  attempt  at  apphing  fault-tolerance,  tbis 
system  was  dupbcated  in  sucb  a  way  that  eacb  dupbcate  can  perform  tbe 
necessary  file-server  functions. 

There  were  several  choices  of  error  detection  algoritbms  for  a  duplexed  sys¬ 
tem  configuration.  One  choice  was  to  operate  both  systems  concurrently, 
loosely  or  tightly  synchronized,  and  use  a  comparison  algorithm  to  deter¬ 
mine  when  tbe  two  systems  disagree;  diagnostics  could  then  be  run  on  both 
systems  to  determine  which  one  was  faulty.  .4notber  choice  was  to  design 
each  system  to  be  self-checking  so  that  it  is  capable  of  defecting  its  own 
faults  and  remo^-ing  itself  from  the  rest  of  the  system  at  that  time.  Our 
experience  was  that  such  self-checking  systems  require  redundancy  at  the 
very  lowest  levels  of  the  system  in  order  for  it  to  reb’ably  detect  internal 
faults — this  is  what  we  were  trjing  to  avoid  from  the  outset.  A  third  ap¬ 
proach  is  to  treat  one  of  the  systems  as  the  “active”  system,  and  the  other 
as  a  “standby”  system.  The  standby  system  does  not  normally  perform 
file  server  operationsf  but  instead  monitors  the  state  of  the  active  system. 
An  error  condition  in  the  active  system  detected  by  the  standby  system 
results  in  acti^•ation  of  an  error  recovery  procedure.  Our  choice  of  error 
detection  mecham’sm  was  the  latter  one  since  it  requires  no,  or  relatively 
little,  custom  hardware  and  softwaire  beyond  what  is  already  in  the  system. 
In  this  configuration,  our  error  detection  algorithm  monitors  the  state  of 
the  active  system  by  the  standby  system  using  the  builtin  mechanisms  of 
the  Novell  Netware  operating  system  and  the  ethernet  hardware.  The  ba¬ 
sis  of  the  detection  algorithm  is  conceptually  simple:  any  fault  that  results 
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in  the  inability  of  the  file-server  to  pro'^ide  a  basic  class  of  service  (such 
as  accessing  a  file  for  read  and/or  •write)  results  in  invocation  of  an  error 
recovery  algorithm.  All  faults  in  the  basic  fa'ult  set  are  “covered”  by  this 
algorithm.  The  resulting  fault-tolerant  system  configuration  is  depicted  in 
Figure  2. 

Given  the  choice  of  error  detection  algorithm,  the  recovery  algorithm  was 
chosen  to  effect  recovery  subject  to  the  dependability  requirements  (pri¬ 
marily  the  recovery  latency).  In  our  case,  the  choice  was  straightforward. 
"NMien  an  error  is  detected  in  the  active  system,  the  active  system  is  re¬ 
moved  from  the  configuration,  and  the  standby  system  is  initialized  and 
brought  online  as  the  active  file  server,  ^^*hen  the  failed  system  is  repaired, 
it  assumes  the  role  of  the  standby  system. 

The  detailed  design  and  implementation  of  these  algorithms  are  presented 
in  later  sections  of  this  report. 

2.3.4  System  Specification  Model 

Given  a  baseline  fault-tolerant  system  configuration,  we  then  constructed  a 
SSM  to  define  the  basic  data  and  control  flows  necessary  to  achieve  fault- 
tolerant  file-server  functionality,  and  an  appropriate  detailed  system  ar¬ 
chitecture.  The  resulting  SSM  comprises  a  System  Requirements  Model 
(SRM)  and  a  System  .4rchitecture  Model  (S.A.M).  The  relationship  between 
these  models  is  depicted  in  Figure  3.  Our  development  models  were  not 
automated  amd,  due  to  the  relatively  small  amount  of  custom  hardware  and 
software  that  was  used  in  the  brassboard  design,  only  the  essential  parts  of 
the  development  methodology  are  included  in  this  report. 

2.3.4. 1  Requirements  Model.  The  requirements  model  abstractly  de¬ 
fines  the  hardware,  software,  and  other  requirements  of  the  EFTC.  Its 
prindpal  tools  are  flo'w  diagrams — data  flow  diagrams  (DFDs)  and  control 
flow  diagrams  (CFDs).  Figure  4  depicts  the  data  context  for  the  EFTC 
da  a  data  context  diagram.  The  corresponding  control  flow  context  is  de¬ 
picted  in  Figure  5.  The  data  context  diagram  depicts  the  data  relationship 
between  the  EFTC  and  its  surrounding  endronment.  The  circle  in  the 
diagram  represents  a  “process” — this  one  representing  the  EFTC  system, 
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Specification  Model 

Architecture  Model 


Requirements  hfodel 


Function  Model 


Control  Model 


User  Interface 


Input 

Processing 


Function  Model 
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Management  Processing 


Figure  3;  EFTC  System  Development  Process 


Figure  4:  Data  Context  Diagram 
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Figure  5:  Control  Context  Diagram 


boxes  represent  static  entities  in  the  ennronment,  and  directed  arcs  repre¬ 
sent  the  Sew  of  data.  For  example,  a  workstations  send  and  receive  data 
from  the  EFTC  in  the  form  of  Sle  server  requests  and  responses,  respec¬ 
tively;  the  labels  on  the  arcs  indicate  the  class  of  data  flo^^~Ing  along  that 
arc.  The  data  context  diagram  is  the  highest  level  of  acti\-ity  in  the  system; 
it  is  the  top  level  of  a  tree  of  DFD’s,  each  of  which  pro-vide  successive  levels 
of  refinement  in  data  flow  ■ndthin  the  EFTC  system.  The  control  context 
diagram  is  identical  to  the  data  context  diagram  except  that  the  directed 
arcs  indicate  the  flow  of  control  between  the  EFTC  and  its  entdronment. 
For  example,  workstations  may  send  discrete  control  signals  to  the  EFTC 
to  initiate  or  abort  file  server  actions,  or  an  operator  may  issue  control  com¬ 
mands  to  the  EFTC.  Like  the  data  context  diagram,  the  control  context 
diagram  is  the  top-level  of  a  tree  of  sudr diagrams,  each  of  which  protdde 
successive  levels  of  refinement  in  control  flow  ^^•ithin  the  EFTC. 

The  first  level  of  decomposition  in  the  requirements  model  results  in  the 
top-level  DFD  and  CFD  diagrams  depicted  in  Figures  6  and  7,  respectively. 
The  DFD,  and  the  aissociated  CFD,  show  three  processes.  Since  we  were 
not  modifj'ing  the  basic  system  functions  of  the  EFTC  baseline  software, 
these  are  lumped  into  the  single  process  named  ‘System’.  The  file-server 
functions  were  broken  out  into  a  separate  process  since  it  was  necessary 
to  establish  the  data  relationships  between  fault-tolerance  functions  ‘Fault 
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Kaaageaent’  and  file-server  functions  ‘File  Server*.  The  CFD  shows  the 
control  relationship  between  these  processes. 

The  second  level  of  decomposition  in  the  requirements  model  results  in 
DFD  and  CFD  diagrams  depicted  in  Figures  8  through  11,  respectively. 
They  show  decompositions  of  process  1  ‘Fault  Kamageirent’  and  process  2 
‘File  server*.  Process  1  is  by  far  the  most  important  since  it  requires  the 
most  customization.  (In  fact,  all  custorn  software  in  the  EFTC  belongs  to 
this  process.) 

2. 3. 4. 2  Architecture  Model.  The  architecture  model  abstractly  de¬ 
fines  the  configuration  of  physical  modules  that  perform  all  the  required 
data  and  control  processing.  The  requirements  from  the  reqmrements 
model  were  mapped  into  an  architecture  model  tziking  all  design  constraints 
into  account.  These  constrednts  included  all  the  requirements  defined  in  the 
Configuration  Item  Development  Specification,  FST  document  No.  FST91- 
281-1,  CDRL  002.42. 
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2.3 


Fijjure  S;  DFD  1:  Faiilt  Manageinent  Process 

2.3 

♦ 

I 

I 

- ►  Ptecocfigure 

Figure  9;  CFD  1:  Fault  Management  Process 


•  A  local  SO  Mbyte  hard-disk  drive  and  a  3.5  inch  floppy  disk  drive 
accessed  \-ia  a  disk  controller  card. 

•  A  SCSI  disk  controller  that  accesses  a  pair  of  SO  Mbyte  hard-disk 
drives. 

•  An  ethernet  controller  card  that  faicUitates  communication  over  the 
external  ethernet  cable. 

•  A  serial  I/O  interface  card  that  pro^■ides  several  RS232-compatible 
ports. 

•  A  card  that  contains  the  CPU,  R.\M,  and  display-driver  subsystems. 

•  A  PC  compatible  keyboard. 

•  A  color  monitor. 
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rS-o-.:t  ?'SJn 


The  backup  system  ha^  am  extra  RS232  I/O  card  that  is  used  to  pro\‘ide 
an  error  signal  when  an  error  is  detected  in  the  primary  system. 

The  ‘'Switch  Box”  is  a  custom-designed  subsystem  that  protudes  physical 
switching  of  RS232  I/O  lines  and  SCSI  disk  I/O  hnes  between  the  two 
computers.  It  is  independently  powered  from  the  computers. 

3.2  Soft'w’are  —  -»■ — 

3.2.1  Software  System  Architecture 
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Figure  11:  CFD  2:  File-Server  Process 


4.  Application  Studies 
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5.  System  Enhancement  Studies 

In  order  to  enhance  :he  nseability  of  the  EFTC,  a  study  was  performed  to 
determine  how  the  architecture  could  be  adapted  to  p.erform  as  a  fde  and 
terminal  server  in  a  UNIX  en'^dronment.  This  section  discusses  hardware 
and  software  changes  necessary  to  sup^port  a  UNIX  environment. 

5.1  Hardware  Configuration 

^^'e  began  with  thebaseiine  EFTC  hardware  depicted  in  Fig-ure'd.  This  con¬ 
figuration  was  extended  to  include  additional  hardware  needed  to  support  a 
typical  E'NIX  en^dronment.  The  re^dsed  basehne  architecture  is  depicted  in 
Figure  13.  The  Interface  Subsystem  comprises  a  mirrored-disk  subsystem 
and  an  ethernet  nraitip’.exer-demultiplexer  (EMD),  as  well  as  the  physical 
sudtching  hardware.  This  st^dtching  hardware  st'dtches  the  RS-232  inputs/ 
outputs  of  the  E.’TD  as  wed  as  the  SCSI  disk  cable  from  one  ccm puter  to 
the  other,  .kdditional  hardware  required  and  not  shown  are  EMDs  at  the 
workstations:  these  are  not  part  of  the  EFTC. 

5.2  Software  Configuration 

The  operating  system  selected  for  each  PC  compatible  is  UNTX-VR.4  (.\ThT 
UNL\  System  V,  Release  4)  because  of  its  position  as  a  de  fccio  industrv 
standard.  .4d<iitiocalIy,  each  PC  compatible  is  configured  with  the  follow¬ 
ing  software; 

•  A  Fault- Tolerance  System  Manager  (FTSM)  module  for  system-level 
error  detection  and  error  recovery. 

•  .4  SCSI  driver.  ” 

•  .4n  ethernet  driver. 

•  Network  File  System  (NFS)  manager.  (Basic  NFS  is  supphed  as  part 
of  the  operating  system.) 

•  Terminal  I/O  port  Manager  (TIM). 

•  Network  Manager. 
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Table  1:  EFTC  Failure  and  Operating  States 


Failure 
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Operational 

Slates 
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System  failed 
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0 

0 

S 

A 

System  OK 

0 

0 

A 

S 

System  OK 

0 

0 

A 

A 
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5.3  Basic  Operating  Principles 

The  EFTC  would  achieve  fault- tolerance  through  system-level  error  detec¬ 
tion  and  spare-switching.  One  PC  coinpatible  is  designated  as  “active,” 
and  the  other  as  a  “spare.”  Table  1  depicts  the  failure  and  operational 
states  of  the  system,  where  ‘F’,  ‘O’,  ‘A’,  and  ‘S’  mean  failed,  operational, 
active,  and  standby,  respectively.  When  a  fault  is  detected  in  the  active 
system  by  the  spare  system,  spare  switching  occurs  and  the  spare  system 
becomes  the  active  system;  more  details  of  this  operation  are  provided  in 
the  following  paragraphs. 
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5.3.1  Initialization 

.4t  svstem  startup,  both  systems  boot  U>'’L\  from  their  local  hard-disks. 
.4iter  a  time  period  sufhcient  for  both  systems  to  complete  their  boot  pro¬ 
cesses.  or  upon  operator  intervention  from  e;ther  system,  an  :nitiaI:zation 
dialogue  commences  between  the  two  FTSMs  to  determine  which  system 
■will  be  active.  The  cci:viiy  protocol  is  designed  in  such  a  way  that,  if  both 
systems  boot  correctly,  system  W  tviU  become  the  active  system;  other-s^dse. 
the  system  that  boots  correctly  s\-ill  become  active.  Once  ‘"mastership"  has 
been  estabhshed,  system  operations  may  begin. 

5.3.2  Error  Detection 

The  EFTC  uses  system-level  error  detection;  i.e.,  the  spare  system  is  used 
to  detect  errors  in  the  active  system.  More  precisely,  the  FTSM  in  the 
spare  system  communicates  and  coordinates  with  the  FTSM  in  the  active 
system  to  detect  errors  and  perform  recovery  actions. 

Errors  in  the  active  system  are  detected  using  the  normal  error  detection 
mechanisms  of  the  system;  i.e.,  the  errors  detected  by  the  hardware  and/or 
the  errors  detected  by  the  system  software.  Each  error  is  assigned  an  trror 
cJcss.  The  FTSM  in  the  spare  system  monitors  these  errors  and,  based  on 
their  set'erity,  initiates  system  recovery  actions.  The  FTSM  in  the  spare 
system  also  detects  an  error  by  omus’.on:  i.e.,  if  the  active  system  fails  to 
respond  to  periodic  queries  from  the  FTSM  in  the  spare  system,  the  active 
system  is  deemed  to  have  failed. 

5.3.3  Error  Recovery 

The  EFTC  uses  system-level  recovery.  The  standby  system  tvill,  as  part  of 
the  recovery  actions,  assume  mastership  and  become  the  active  system.  The 
terminals  and  external  hard-disks  ■will  be  physically  st:tdtched  to  the  new 
active  system,  terminal  ports  and  file  systems  -will  be  logically  connected 
to  the  active  system,  and  system  operations  resumed. 

In  order  to  support  Level  2  transparency  in  file  server  operations  (Level  2 
transparency  means  that  users  ^^nll  observe  a  system  outage  while  error 
recovery  takes  place;  they  are  not  able  to  continue  the  current  session,  but 
there  will  be  no  loss  of  data  due  to  the  failure),  file  server  “transactions” 


F«dl-Safe  Technology  Corp. 


24 


FST51-2S:-2 


are  sionitored  by  the  spare  system  (before  :be  faiJure  of  :be  active  system). 
Upon  spare- s^rtching,  any  transactions  that  had  not  committed  by  the 
failed  system  will  be  redone  by  the  newly  active  system.  Terminal  ports 
are  simply  reconnected  to  the  active  system — users  are  required  to  relogin. 

5.3.4  File  Server  Operation 

Clients  external  to  the  EFTC  system  may  access  and  use  its  fJe  server  func¬ 
tionality.  File  server  functions  are  accessed  wia  remote  file  server  commands 
across  the  eihernet.  The  EFTC  currently  implements  the  NFS  protocol  for 
remote  file  operations.  (Other  file  server  protocols  can  be  optionally  sup¬ 
ported.) 

5.3.5  Terminal  Server  Operation 

Users  external  to  the  EFTC  system  may  login  remotely  ^■ia  the  terminal 
server  functionality.  The  EFTC  p^o^ddes  a  number  of  serial  ports  for  this 
purpose.  During  system  recovery,  all  serial  ports  are  physically  stritched 
from  one  side  of  the  system  to  the  other;  any  users  currently  logged  in 
^■ia  one  of  these  serial  ports  ■will  have  their  sessions  terminated — they  must 
login  again  when  recovery  is  completed. 
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All  of  the  systems  tested  by  Fail-Safe  Technolop^  (FST)  were  configured  on  pairs  of 
personal  computers  (two  computers  in  one  box).  Initial  tests  were  run  on  the  Diversified 
Technolo^  single  box  PC-compatible  computer  system.  Netware  286  network  software 
was  installed  and  failure  modes  tested.  A  Netw’are  286  system  was  configured  on  Unisys 
computers  (supplied  by  NASA)  and  returned  to  them  for  evaluation. 

As  NASA’s  procurement  activities  were  operating  on  a  large  multi-server  network  with 
hundreds  of  users,  their  test  results  were  as  valid  and  reliable  as  the  in-house  test 
performed  at  Fail-Safe  Technology.  The  NASA  tests  have  been  conducted  over  an  8 
month  period  to  ensure  comprehensiveness.  Fail-Safe  Technology  then  installed  Netware 
386  network  software  on  an  IBM  compatible  "clone"  style  computer  tower  486  system. 

The  IBM  clone  system  was  never  made  to  work  reliably  due  to  a  few  voids  in  true 
compatibility,  which  is  evident  in  some  brands  sold  in  the  marketplace.  FST  installed  the 
same  software  on  an  NCR  desktop  computer  system.  This  system  worked  well  and  reflected 
problem-free  operation  when  switched  without  open  files. 

6.1.  Tests  with  Diversified  Technology  Single  Box  PC-compatible  Computer  System 

Initial  tests  were  run  on  this  equipment  with  DOS  programs.  Software  was  written 
and  coded.  This  operating  system  add-on  program  transferred  keystrokes  from  each 
PC  into  a  keystroke  buffer  of  the  second  machine.  Hardware  was  designed, 
developed  and  fabricated  to  transfer  an  RS232  communications  serial  interface 
input  into  both  machines  within  the  single  box  and  selected  the  output  of  the  active 
primary  computer.  This  design  had  limited  success.  Some  off-the-shelf  programs 
(such  as  "Window's")  did  not  use  the  keyboard  buffer,  therefore,  failed  to  function 
with  this  technique.  Other  programs  (e.g.,  PC  Base  IV)  would  omit  occasional 
keystrokes  and  go  out  of  synchronization.  Results  of  primary  market  research 
conducted  on  logical  potential  applications  exposed  interest  from  only  a  limited 
number  of  prospective  customers  for  fault-tolerant  DOS  applications.  There  were 
few  DOS  applications  identified  in  the  market  for  critical  (merational  functions. 
The  few  prospects  that  were  interested  in  fault-tolerant  DOS  (such  as  security 
monitoring  applications)  had  implemented  special  programs  (custom  software). 
This  is  logical. 

6.2.  NASA  System  (Unisys  computers  with  Novell  286  network  software) 

Efforts  were  concentrated  on  creating  a  fault-tolerant  system  that  would  operate  on 
off-the-shelf  Novell  network  programs.  FST  installed  Novell  286  on  a  hardware 
system  consisting  of  two  Unisys  computers  sent  to  us  by  NASA-Houston  and 
returned  for  instSlation  and  test  operations. 

FST  created  a  software  program  to  monitor  the  primary  network  file  server  from  a 
backup  secondary  machine.  When  the  backup  was  unable  to  access  a  file  on  the 
primary  it  switched  the  primaiy  SC  SI  disk  to  itself  and  re-booted  as  the  Novell  file 
server.  The  system  performed  without  problems  in  FSPs  laboratory  and  was 
shipped  to  NASA  for  installation  and  operation  in  a  real-world  environment  as  a 
Beta  test  site.  NASA  encountered  the  following  problems  over  time  while  the 
system  was  in  use: 
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a.  The  backup  would  sometimes  log  into  and  monitor  another  server  rather 
than  its  own  primary.  This  was  due  to  the  enormous  numbers  of  alternate 
servers  operating  on  the  system.  This  was  corrected  by  FST  with  a  software 
modification  and  enhancement. 

b.  The  backup  would  switch  over  when  long  (size  and  time)  file  transfers  were 
tying  up  the  network.  FST  resolved  this  problem  with  a  softu’are  change 
permitting  a  lengthier  time  for  the  back-up  to  receive  the  correct  information 
mom  the  primary  computer. 

c.  The  backup  would  sometimes  not  complete  its  booting  without  operator 
input  when  the  files  were  open  when  the  switch  occurred.  This  problem  was 
partially  resolved,  but,  requires  additional  software  enliancements  to  improve 
the  reliability  in  this  situation.  Novell’s  program  must  be  modified  to 
accomplish  this. 

63  Tests  with  Otlier  Hardware 

FST  installed  Netware  286  network  software  on  both  of  the  Diversified  Systems 
computers  with  hardware  designed  and  fabricated  by  FST  to  troubleshoot  the 
problems  identified  by  NASA  as  shown  above.  FST  solved  problems  a  and  b  with 
software  modifications  and  enhancements.  Problem  c  was  not  totaUy  solved. 
Coordination  with  Novell  will  be  required  to  alter  their  software  slightly  and 
accommodate  resolution  of  this  problem.  It  was  decided  to  convert  and  upgrade  the 
system  to  Netware  386,  which  ought  not  manifest  this  problem.  The  Novell  Netware 
386  would  probably  be  demanded  by  all  future  customers,  anyway. 

FST  installed  Netware  386  on  two  IBM  compatible  "clone"  computer  towers  with 
switching  hardware  and  software  designed,  developed  and  fabricated  by  FST.  Data 
was  corrupted  on  almost  every  simulated  failure  during  testing.  The  clones  were 
returned  to  the  vendors  as  a  result.  FST  then  installed  Netware  386  on  NCR  3445 
computers  for  further  testing.  The  switchover  worked  as  well  on  NCR  as  the 
Netware  286  performed  on  the  Diversified  System. 

6.4  Conclusion 

The  FST  fault-tolerant  adapter  hardware  performed  very  well  on  the  off-the-shelf 
microcomputers  most  accepted  in  the  marketplace,  converting  them  to  fault-tolerant 
^sterns.  Problems  were  experienced  with  a  single  "no  name"  brand  clone,  which  was 
aiscovered  not  to  really  be  100%  IBM  compatible  due  to  design  nuances  in  the  hardware. 

FSTs  hardware  has  been  tested  to  the  point  where  it  is  considered  to  be  ready  for 
production. 
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